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(54) A method of address compression for cell-based and packet-based protocols and hardware 
Implementations thereof 

(57) It is disclosed an algorithm able to compress a 
defined set of addresses S, the set of addresses to be 
compressed, belonging to the set U, the whole address- 
ing space; for each of these addresses the algorithm 
must identify one and only one address belonging to C, 
the set of compressed address (i- e. perform a transfor- 
mation S -> C). The algorithm may be implemented 
using some low-cost random access memories (RAM) 
and some control logic. A performance comparison 
shows that is possible to perform the address compres- 
sion using one order of magnitude less memory respect 
to the state-of-the-art techniques. 

Basically, the method of the Invention combines the 
splitting of the incoming address space (U) into a plural- 
ity of sub-spaces, a tree search algorithm for clustering 
a defined set (S) of identifiers contained in the sub- 
spaces into which the incoming addresses space (U) 
has been split and a sequerrtial search performed within 
the right cluster in order to identify the compressed 
address belonging to space C. 

The patent covers the algorithm, a preferred 
embodiment and some extended embodiments, tiiat 
give extra gain. 

Thanks to the invention is thus possiksle to imple- 
ment silicon devices able to compress one order of 
magnitude more managed channels with respect to the 
state-of-the-art techniques, without area changes. 

Conversely, it is possible to implement the address 
compression function with one order magnitude less 
memory resources with respect to the state-of-the-art 
techniques. 
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Description 

1 BACKGROUND OF THE INVENTION 

5 1 .1 ADDRESS COMPRESSION PROBLEM DERNITION 

[0001] Notably, for each Communication Protocol, an Incoming Address Space (the maximum number of channels a 
specific protocol can handle) is defined. In this document reference is made to the so-called addressing space of 2^* 
bits as the set of Incoming Addresses. 
10 [0002] On the other hand, a telecom equipment is able to deal only with some managed channels. The number of 
simultaneously manageable channels is finite and is a typical design target. Each managed channel must be address- 
able by means of an internal identifier, that is a subset of the Incoming Address. In this document reference is made to 
the space of 2^^^ bits of the internal identifiers as the set of Compressed Addresses. 

[0003] In a telecom equipment, a function mapping some points belonging to the universe of Incoming addresses 
15 (2 bits) to a set of Compressed Identifiers (2'^'^ bits) should be implemented. This function is called the Address 
Compression Function. 

[0004] Due to network management reasons, the Incoming Address Space is very large. On the other hand, the 
number of channels that must be managed simultaneously nowadays by telecommunication apparatuses is also very 
large. Moreover, data link speed is increasing at an impressive pace: in ten years from 64 Kbit/s to 155 Mbit/s and now 
20 to1.2Gbit/s. 

[0005] Because of this, the efffldency of the design of the Address Compression Function is today a key factor in 
equipment like routers and switches. Altogether, designing has become critical because, due to the increased data 
speed, the time that can be spared to perform the Address Compression Function is reduced. On the other hand, the 
increasing number of manageable channels augments costs because of the increasing number of resources needed to 
25 perform the Address Compression Function. 



1.1.1 ADDRESS COMPRESSION PROBLEM DEFINITION 



[0006] The aim of the algorithm is to compress a defined set of addresses S. the set of addresses to be compressed, 
belonging to the set the whole addressing space, as shown in Figure 1 . For each of these addresses the algorithm 
must identify one and only one address belonging to C. the set of compressed address (i. e. perform a transformation 

S — > G). 



^ ^ dimension of the wliole addressing space ( t/ = ^,,..,a^, |) 

^cpr dimension of the space of compressed addresses ( C s ^o»— ^^j*^ -f^ 

40 

where: ricpr <n,C<^U. 

[0007] The cardinality of S must equals the cardinality of C. 

1.1.2 ADDRESS COMPRESSION FUNCTION AND IP APPLICATION 

45 

[0008] The most fundamental operation in any IP routing product is the Routing Table search process. 
[0009] A packet is received with a specific Destination Address (DA), identified by a unique 32-bit field in current IP 
Version 4 implementations. The router must search a fonwarding table using the IP Destination Address as its key and 
determine which entry in the table represents the best route for the packet to take in its journey across the network to 
50 its destination. 

[0010] A ( <flat)> fonwarding table would have a size of 2^2 addresses, that means 4 Gbytes of address space (16 
Gbytes of data). The DA must be compressed to point to a reasonable table size. 

[0011] The route search operation is the single most time-consuming operation that must be performed in routers 
today and typically defines the upper bound on the router's ability to fbn«fard packets. 
55 [001 2] The problem has grown even more challenging in recent years. 

[0013] Data links now operate routinely at 100 MBits/second, and generate nearly 150.000 packets-per-second 
requiring routing. 

[0014] New protocols, such as RSVR require route selection based not only on Destination Address, but potentially 
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also Protocx)! Number. Source Address, Destination Port and Source Port 

[0015] IP Version 6 will Increase the size of the address field from 32 bits to 128 bits, with network prefixes up to 64 
bits in length. Expanded use of IP Multicasting requires that searches include large numbers of Oass D (Multicast 
Group) addresses with large numbers of users. 

[0016] Moreover, the ever-expanding number of networks and hosts on Internet is making routing table sizes larger 
and larger. 

1.1.3 ADDRESS COMPRESSION FUNCTION AND ATM APPLICATIONS 

[0017] ATM data equipment, to be compliant with ITU and ATM-forum specifications, must be able to receive ATM 
cells for any admissible value of the header fields VPI. VCI. The total length of these fields is 24 bits (16.7 millions of 
admissible values). 

[0018] On the other hand, the ATM equipment is designed to manage a number of internal channels (at least) equal 
to the maximum number of engageable channels. This number deperrcis on the application: from one to hundreds in the 
case of terminals; some thousands (4K. 64K) in case of core network equipment. 

[0019] In the following description, the univocal (shorter) internal channel identifier, will be referred to as Channel 
Identifier (CID). 

[0020] It is evident the requisite that the processing be able to map from any possible value of VPi.VCI (24 bits) to any 
possible CID (e.g. 12 bits). 

1.2 ALGORITHM CLASSES 

[0021] A compression function able to map from a string of length N bits to a (unique) string of length Ncpr (Ncpr<N) 
can be implemented in various ways. 

[0022] Two main classes exist: the algorithms with an unpredictable duration belong to a first class; the others, with 
a predictable duration, belong to the second one. 

[0023] For those belonging to the first class, it is not possible to know for how much time (microprocessor instructions 
or clock cycles) the algorithm will run before hitting the compressed identifier. It will depend on the number of active con- 
nections. These algorithms are normally easier to implement, do not require lots of resources and can be sped-up.only 
by improving RAM access time of the memories where the search tables are located. 

[0024] For the algorithms of the second class, (predictable duration algorithms) it is possible to know, UNDER ANY 
CONDITION, how much time (microprocessor instructions or clock cycles) the algorithm will run before hitting the com- 
pressed identifier. These algorithms often require a lot of resources. 

[0025] An algorithm belonging to the second dass ensures that the maximum search time is less than the time used 
to receive the shortest packet^ , this guarantees the maximum allowable throughput of the equipment. 

1.2.1 UNPREDICTABLE DURATION ALGORITHMS 

[0026] IP routers companies have developed the algorithms belonging to this class some years ago. It is possible to 
call them < (classical route search techniques > ) . The main algorithms will be explained for an IP context to provide the 
reader with useful historical background. 

1.2.1.1 THE PATRICIA TREE 

[0027] This Is the most popular algorithm used in router "slow paths". The fonwarding table (associating each prefix 
entry with an exit port and next-hop MAC address) is stored in a "Binary Root Tree" form. 

[0028] The table is organized in a series of "nodes", each of which contains a route of different length, and each of 
which has two "branches" to subsequent nodes In the tree. At the ends of the branches there are "leaves", which either 
represent full 32-bit host routes (for devices attached directly to the router) or most-specific routes available to a partic- 
ular subnet. 

[0029] The algorithm is able to map ANY incoming vector to a un'que outcoming identifier Unfortunately, in the worst 
case, the algorithm will have to travel all the way to the end of the tree to find a leaf, and the time needed cannot be 
absolutely predictable. 

[0030] The Patricia Tree approach does not scale well to level-2 packet switching: a worst-case lookup involves a large 
number of memory accesses, taking far more time than that available at gigabit rates. Moreover, hardware implemen- 
tation is rather complex. This algorithm was developed for general-purpose software implementations. 

^ 64 bytes for IP. 53 bytes for ATM 
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1 .2.1 .2 HASHING TABLES 

[0031] "Hashing" is an alternative approach. Unlike the Patricia Tree, hashing operates strictly on an exact-match 
basis, and assumes that the number of < (channels)) (IP Destination Addresses, VPI/VCIout) the system rrujst handle 
5 at any one time be limited to a few thousands. 

[0032] A "hash" function - a sort of compression algorithm - Is used to condense each incoming identifier (24 or 32 
bits) in the table to a smaller-sized entry (8-10 bits typically). 

[0033] When a packet is received, an equivalent "hash value" is computed quickly from Its incoming identifier. This 
value points to a hash tatke (named a "slot") that corresponds to one or more outooming kJentif iers. The compression 
10 effected by a hashing function makes the table small enough to be quickly searched sequentially using sinple hard- 
ware-based exact matching techniques. 

[0034] "me main problem involved in the hashing technique is that it assumes a < (flat ) ) distribution of the values of 
incoming identifiers. The ( (hash ) ) function maps the space of possible values of incoming identifier in a plurality of suk)- 
spaces. 

15 [0035] In Figure 2, the ellipse indicates the U space and the incoming valid identifiers, that is the S space, are indi- 
cated as tiny circles. The ( (hash ) ) function generates the boundaries between sub-spaces. If, as depicted in Figure 3, 
in a sub-space a number of identifiers greater than the slot size (hash table) must be mapped, it is necessary to recal- 
culate anew the hash function in an appropriate way. 

[0036] This involves item sorting in hash tables that cannot be performed in a real time mode. 
20 [0037] This process is easy to implement in hardware and tends to perform fairly well, albeit in a probabilistic man- 
ner. 

[0038] Unfortunately there are a number of drawbacks with this algorithm. In a hardware implementation it is not pos- 
sible to change ( (on the fly)) the ((hash)) function, because a full item sorting is implied. This means that the only 
way to overcome the problem is to increase the slot length, but obviously this is not always possible. 
25 [0039] The main ATM IC developers (Motorola, IDT, TranSwitch) have implemented an algorithm of this kind. A typical 
architecture is shown in Figure 4 

[0040] A main problem is that the incoming identifier processing time is not deterministic On some case a sequential 
search is needed) and eventually will become longer than one packet (cell) time). 

[0041] The ACF function is implemented by means of several readings in the Hash tables that are written by the con- 

30 trolling microprocessor In a "off-line" manner. 

[0042] The algorithm implies the subtle assumption that the sequence of incoming identifiers be ( (spread ) ) on the 
entire set of sub-spaces and that in any sub-space the average search time be shorter than the packet (cell) time. 
[0043] Moreover, use of a quite long f ifo (1 0, 20 packets/cell positions) is required in order to decouple the incoming 
rate speed from the speed of the compression algorithm, that in the average would be the same. 

35 [0044] In some cases, it may happen that the packet (cell) is lost or misrouted. The only way to cure this problem is 
to increase the speed of the hash table^. 

[0045] This architecture is preffen-ed by N IC chip providers because is cheaper, but it is unable to support the mapping 
of any possible incoming identifier to local identifiers. 

[0046] In the present context sometimes use is made of different expression for indicating materially the same thing. 
40 In particular the same N-bit string or the same Ncpr-bit string is often referred to with the expressions: physical layer 
Identifier, virtual path Identifier address, vector. These are expression tiiat are commonly used and perfectiy understand 
by technidans and the different expressions are often used when describing an algorithm or a data processing struc- 
ture, and so forth. 

45 1 .2.2 PREDICTABLE DURATION ALGORITHMS 

[0047] In predictable duration algorithms, tiie ACF is performed under any condition in a time that may be less than 
or even equal to the packet time (cell period). A typical architecture is shown in Figure 5. 

[0048] Because the algoritiim duration may be knowingly shorter than a packet (cell) cycle, it is possible to admit ANY 
so type of incoming traffic. On the other hand, more chip or system resources are needed to implement the function than 
those do that would be required by an algorithm of unpredictable duration. 

[0049] There are three well-known techniques that are able to perform ACF predictably in less than one pactet (cell): 

• CAM 
55 • Sequential search 
Binary tree 

^ For example, the Motorola ATMC devices needs 10 nS hash memories. 
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1.2.2.1 CAM 

[0050] According to this approach, the incoming address (e.g. VPl.VCI) is input to a Context Access Memory. The 
CAM hits the correct compressed. II there is no hit the cell is discarded. 
5 [0051 ] The CAM is wide as the incoming address and is deep enough to accommodate the maximum number of con- 
nections. 

[0052] The time of execution of the ACF typically of few clock cycles. It is in any case less then a cell time. The main 
problem of this architecture is the availabilrty of the CAM nrodule^. 

10 1 .2.2.2 SEQUENTIAL SEARCH 

[0053] To obtain a compressed identifier from an incoming address, it is possible to perform a sequential search on 
a RAM, for a number of cycles less or equal the packet (cell) time. A relatively small RAM. a counter to generate 
addresses and a unique 24-bit comparator is all is needed, as depicted in Figure 6. 



15 



1 .2.2.3 EXTENDED SEQUENTIAL SEARCH 



[0054] To increase the extent of the sequential search without exceeding the number of available dock cycles, it is 
possit)le to use several RAMs, several counters to generate the addresses, several 24-bit comparators and a priority 
20 encoder, as depicted in Figure 7. 

1.2.2.4 BINARY TREE 

[0055] The mapping from the valid incoming vectors to the compressed identifier is implemented by means of a chain 

25 of memories. r-- 
[0056] A pointer chain link has to be written in these memories in order to link any valid incoming vector with the right 
compressed identifier 

[0057] The first memory is addressed by a bit slice of the incoming address (typically the most significant bits). The 
content is a pointer to the second one. 
30 [0058] The second memory is addressed by the pointer obtained from the first one. chained with a new slice belong- 
ing to the incoming vector. The content is a pointer to the third one. 

[0059] The third memory is addressed by the pointer obtained from the second one, chained with another slice 
belonging to the incoming vector. The chain ends when any bit belonging to the incoming address has.been used... 
[0060] In order to ensure a no-blocking probatMlity, the wide of any memory has to be equal to Ncpr. 

35 [0061] Unfortunately, because of this, the memory utilization is really poor (around 5. 10%). 

[0062] Figure 8 shows the organization of the memories needed for implementing a Binary Tree Search. 
[0063] In Figure 9 the ellipse depicts the U space and the set of incoming valid identifiers, the S space is indicated by 
the tiny circles. The Binary Tree technique splits the U space in areas of equivalent size, by means of a direct address- 
ing table or DAT; then the sub-spaces are split again, by means of RTis, in order to ensure that no more than a point 

40 belonging to S is present in a particular sub-space. . . 

[0064] Figure 10 shows a typical implementation related to ATM words of 24 bits of incoming VPl.VCI that must be 
converted to proper channel identifiers CID, 12 bits wide. The basic assumption is to implement a research path on 
some external RAM bank, addressed by means of VPI.VCI fields. 

[0065] Four banks (ATM Compression Blocks) of RAM are addressed for a total amount of 392 Kbytes, in order to 
45 have up to 4096 different CIDs. Four addressing cycles are needed. The dimensions of the memories depend on the 
maximum number of CIDs needed. 

[0066] US Patent No. 5,41 4.701 describes a method and a structure for performing address compression in an ATM 
system according to a so-called content addressable memory (CAM) as described above. 

[0067] Standing the requisite of performing the required mapping of incoming N-bit identifiers into Ncpr-bit virtual path 
so identifiers within a cell time slot, the implementation of a consequent data processing structure for performing such an 
address compression function, following one of the known approaches as the ones reviewed above, implies the use of 
a relatively large amounts of physical resources in terms of RAM memories. 

[0068] Irrespectively, of the approach followed, the RAM requisite for a reliable operation of the data processing struc- 
ture errployed for performing address compression represents a crucial cost factor and it is evident the opportunity of 
55 finding methods of performing the address compression more efficient than the presently known ones and that may be 

^ On the market, there is a component that implements CCF function in this manner. It is the Fujitsu MB86689 Address Trans- 
lation Controller (ATC) 
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realized at a reduced cost. 



2 OBJECT AND SUAAMARY OF THE INVENTION 
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[0069] It has now been found a method of address compression outstandingly more efficient than the known methods, 
capable of reducing the RAM requisite for comparable performances in terms of number of dock cycles necessary to 
complete the compression algorithm. 

[0070] Moreover, when assuming an optimization of the data processing structure of the invention, the performance 
in terms of the two parameters of memory requisite and of number of clock cycles required, is significantly better than 
the performance obtainable from any of the systems realized according to the known approaches. 
[0071 ] These important advantages are achieved, according to the present invention, by a method that combines cer- 
tain aspects of an unpredictable duration algorithm with those of a classical sequential search algorithm. The synergis- 
tic combination of different approaches produces the reported outstanding performance. 

[0072] Basically, the method of the invention combines the splitting of the incoming address space (U) into a plurality 
of sub-spaces, a tree search algorithm for clustering a defined set (S) of identifiers contained in the siA)-spaces into 
which the incoming addresses space (U) has been split. 

[0073] Having so clustered the elements of the defined set (S) of identifiers, a sequential search is performed within 
each cluster so constructed for identifying the Ncpr-bit identifier belonging to the compressed address space (C). 
[0074] By performing the sequential search so restricted over a pre-identified cluster of a known size, ensures iden- 
frf icatfon within a given number of clock cycles (a predictable time span). The system may be further optimized for either 
reducing the number of clock cycles required by the sequential search or for reducing the memory requisite. 
[0075] The method of the invention is more precisely defined, respectively, in the independent claims 1 and 6 for a 
unclassHied address space and prefen-ed embodiments are defined in claims 2 and 5. while the data processing struc- 
ture of the invention that implements the method is defined in the appended claims 7 and 12 for a classified address 
space, and preferred embodiments in claims 8 to 11. 
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[0076] 

Rgure 1 - Representation of address compression problem 

Figure 2 - Example of "hit" distribution 

Figure 3 - Reconrpilation of "hash" function 

Figure 4 - Typical unpredictable duration dass implementation 

Figure 5 - Typical predictable duration dass implementation 

Figure 6 - Sequential Search structure 

Figure 7 - Extended Sequential Search structure 

Figure 8 - Binary Tree search structure 

Figure 9 - U space splitting via Binary Tree technique 

Figure 10 - Channel compression block data structure in ATM environment 

Figure 1 1 - U space splitting via the CSSA technique of the invention 

Figure 12 - Block diagram of a CSSA system of the invention 

Figure 1 3 - Layout of DAT, RTl and SST btocks 

Figure 14 - Example 1 of CSSA operation 

Figure 1 5 - Example 2 off CSSA operation 

Figure 16 - Alternative embodiments of the system of the invention 

Figure 1 7 - Extended CSSA#1 - Sequential Search Table with different SSTi 

Figure 18- Extended CSSA #2 - Sequential Search Table with a single, wide SST 

Figure 1 9 - Extended CSSA #3 - pipelined f ifos. staged architecture 

Figure 20 - Problem representation example for Extended CSSA #4 

Figure 21 - E xtended CSSA #4 architecture 

Figure 22 - Implementation example for Extended CSSA #4 

Figure 23 - Generic Address Compression Function 

Figure 24 - Performance evaluation method 

Figure 25 - Pure sequential search structure 

Figure 26 - Extended sequential search structure 

Figure 27 - Binary Tree search structure 

Figure 28 - Clustered Sequential search structure 
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DESCRIPTION OF AN EMBODIMENT OF THE INVENTION 

3.1 THE CLUSTERED SEQUENTIAL SEARCH ALGORITHM (CSSA) OF THE INVENTION 

[0077] The novel CSSA technique of the invention splits the U space in areas of equivalent size, by means of a DAT 
that is preferably made as small as possible; then the sub-spaces are effectively split again, by means of a cascade of 
Rtis. in order to ensure that no more that SSLL points belonging to S are present in a particular sub-space. In the exam- 
ple shown in Figure 1 1 SSLL is set to 4. 

[0078] To identify, in the addressed sub-spaces, the only points belonging to S. a sequential search is performed by 

means of a SST (sequential Search Table). 

[0079] The following paragraphs explain in detail the algorithm. 

3.1.1 CSSA DESCRIPTION 

[0080] The proposed algorithm combines both clustering of space and sequential search. 

[0081] The set S is split in clusters and within each cluster a sequential search is performed. More precisely CSSA 
is performed in three main steps: 

1 . splitting of L/ in equal subspaces (each subspace can contain either the whole S or some elements of S nor any 
element of S); 

2. clustering of S (elements of S are divided among a set of clusters); 

3. sequential search within each cluster; 

[0082] Splitting is performed into the Direct Addressing Table (DAT), the clustering phase is peribnned in a cascade 
of Routing Tables (RTi). while the linear search is performed in the Sequential Search Table (SST). This structure is 
illustrated in Figure 12. 

[0083] As depicted in Figure 12 the structure feeds an eventual translation table (TT). according to a common tech- 
nique. 

[0084] The listed tables have the layout shown in Figure 13. The fields belonging to each table are described in the 
following boxes. 



RTi (i-nth Routing Table) 


RTi[iJ.PTR 


pointer to a selected page of RT(i+l) (if RTi[i].PTR = k xx. 
means that page k of RT(i+J) is pointed {RT(i+l)[kJ))i 


RTi[i].USED 


number of times a pointer is used, if RTili].USED=m it means 
that the link RTi[i].PTR is routed m times; 




page length for RTi (PGL^, is a power of 2, 
"pcL^. =log2 PGL^); 




number of pages of RTi; 



SST (Sequential Search Table) 

N^f^^^^ number of clusters; 

SSLL length of each sequential search list (Sequential Search List 

Length); 

WZ^yr word length of RTi (expressed in bit); 



[0085] The CSSA structure has are three different mode of operation: 
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Initializalion Mode; 
[0086] 

Configuration Mode; 

Normal Operation Mode; 



[0087] In the Initialization Mode the contents DAT. RTi, SST and SSTPF are initialized with default values In the con- 
10 figuration mode the contents of DAT. RTi and SST have to be set to values suitable for compression of a defined set of 
address S to be compressed. In the normal operation mode, the algorithm finds for each incoming vector (INCVECT) 
fed thereto, a con-esponding outcoming vector (OUTVECT) that matches the INCVECT 
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3.1.2 INITIALIZATION MODE 

[0088] In the Initialization Mode the contains of DAT, RTi. SST and SSTPF are initialized with default values: 

USED fields are initialized with 0; 

PTR fields are initialized with UNASSIGNED; 

ADDR fields are initialized with UNASSIGNED; 
[0089] The pseudo-code for the Initialization Mode is: 



r DAT initialization V 

FOR i = 1 TO Nsubapace 

DAT[n.PTR = UNASSIGNED 
DAT[i].USED = 0; 
END FOR; 

A RTi initialization V 
FOR EACH RTi 

FORj=1TONpG.Rr, 

FOR k = 1 TO PGLrti 

RTiD.k].PTR = UNASSIGNED 
RTi0.k].USED = O; 

END FOR; 

r SST initialization V 

FORi=1TONcluster 

SST[i].PTR = UNASSIGNED 
SST[i],USED = 0; 
END FOR; 

/'step 0,3 (SSTPF initialization) V 

FOR i = 1 TO Nch,ster 

SSTPFfiJ.USED = 0; 
END FOR; 



55 



3.1.3 NORMAL OPERATION MODE 



'?^'r^i^ Operation Mode the algorithm splits the whole space U in A/^^^spaci, equal subspaces by means 

of the DAT Then the elements of S. that may fall into anyone (and even into more than one) of the mentioned sub- 
spaces, are clustered in N^^^^,^, sets by the cascade of RTi. The result of this clustering process (that may be visualized 
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as a further splitting of the S set) is a cluster identifier (CLID (CLuster IDentifier)) in which the sequential search is per- 
formed; this is done in the SST. Regarding the sequential search, if one of the addresses stored in the selected cluster 
(I.e. the one at position SSLPOS (Sequential Search List POSition)) match (In practice the Comparison Result is mon- 
itored) with the Incoming Vector INCVECT, then the Incoming Vector is compressible and its compressed form c is rep- 
5 resented by the pair (CLID. SSLPOS): othenwise the Incoming Vector INCVECT will not be compressed. 

[0091 1 It is also possible to define the compressed fbnm c as the absolute address of the row identified by the sequen- 
tial search phase in the SST. 
[0092] Summarizing: 

10 • if for a given INCVECT € S 

3 I c = (CLID, SSLPOS) I {OUTVECT = SST(c) = INCVECT} 
-> INCVECT is conpressiWe; 

• If for a given INCVECT e S 

15 a c = (CLID. SSLPOS) I {OUTVECT = SST(c) = INCVECT} 
INCVECT is not configured for compression; 

• all INCVECT g{U-S) are not configured for compression; 

20 [0093] In Figure 12 a Translation Table (TT) is shown. This block is not part of the structure and is optional. It does 
not intervene in the algorithm of the invention and is shown as a simple implement to perform also an address transla- 
tion, the result of which Is an Outcoming TAG. 

[0094] The pseudo-code for the algorithm (Normal Operation Mode branch) is: 
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MAIN(s, OUTVECT) 

/* global declarative part V 

TYPE PTR IS pointer to page of DAT, RTi, SST, SSTPF; 

TYPE USED IS the number of different paths which pass through a specified row of DAT RTi, 
SST. SSTPF; 

TYPE ROW IS row location of DAT. RTi. SST, SSTPF; 

TYPE ADDR IS address € S; 

TYPE CMPADDR : RECORD IS (PTR, ROW); 

VAR s : ADDR; 

VAR outvect : ADDR; 

VAR c : CMPADDR; rceCV; 



ptrl = DAT[dat_rowsel(s)]; 

IF (ptrl = UNASSIGNED) THEN 

r s not configured for compression: exit V 

c := {UNASSIGNED. UNASSIGNED}; 

OUTVECT := UNASSIGNED; 

EXIT; 

ELSE 

r RTi 7 

FOR i=1 TO nst 



ptr2 = RTi[ptr1. rt_rowsel(s)J; 
IF (ptr2 = UNASSIGNED) THEN 

/* s not configured for compression: exit V 

c := {UNASSIGNED. UNASSIGNED); 

OUTVECT := UNASSIGNED; 

EXIT; 

END IF; 



END FOR; 
/• SST V 

sstpf^used = SSTPF[ptr2].USED; 
IF (sstpf.used = 0) THEN 

r s not configured for compression: exit V 

c := {UNASSIGNED, UNASSIGNED}; 

OUTVECT := UNASSIGNED; 

EXIT; 

END IF; 

FOR i=1 TO sstpf.used 

addr = SST[ptr2].ADDR; 
IF (addr <> UNASSIGNED) THEN 



BEGIN 



75 



/* local declarative part V 
VAR ptrl. ptr2: PTR; 
VAR sstpf.used : USED; 
VAR addr : ADDR; 



20 



r executive part V 
/• DAT V 



so 



r addr found: exit */ 
c := {ptr2. sstpf_used}; 
OUTVECT := addr; 
exit; 



END IF; 



END FOR; 



END; 
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3.1.4 CONFIGURATION MODE 



[0095] Given the set S of address to be compressed and the set C of compressed addresses the setup of CSSA con- 
sists in assigning all the parameters DAT[f].PTR, DATpJ.USED, RTiOlPTR. RTi[i].USED to conf igure for compression all 
5 the elements of S set. CSSA supports both an absolute and an inaemental Configuration Mode: 

in an absolute mode, all elements of S are set for compression in a single Configuration Mode session, that is all 
parameters of DAT, RTi, SST SSTPF are written from scratch; 

10 • in an incremental mode, new elements are configured or unconfigured for compression in and incremental way, that 
is without rewriting all parameters of DAT, RTi. SST. SSTPF from scratch. 

[0096] The pseudo-code for the Configuration Mode is: 



15 



20 



25 



30 



35 



40 



45 



MAIN(S) 
BEGIN 

/• declarative part V 

TYPE PTR IS pointer to page of DAT, RTi, SST, SSTPF; 

TYPE ROW IS row location of DAT. RTi, SST, SSTPF; 

TYPE ADDR IS address € S; 

VAR s: ADDR; 

VAR j: PAGE; 

VAR k: ROW; 

/* executive part V 
FOREACHseS 

/'step 1 (DAT configuration) V 

j = rtj)agesel(1,s); 

k = rt_rowsel(1 ,s); 

DAT[dat_rowseKs)J.PTR = j; 

RT1D.k].USED ++; 

DAT[dat_rowsel{s)],USED ++; 

/* step 2.i (RTi configuration: from RTI to RTn) V 
FORi = 1 TO (n-1) 

RTi[j.k].PTR = rt_pagesel(i+1.s); 

RTi+1 [rt_pagesel(i+1.s),rt_rowsel{i+1,s)).USED ■ 

j = rt_pagesel(i+1,s); 

k = rt.rowsel(i+1 ,s); 
END FOR; 

/* step 3 (SST& SSTPF configuration) V 
RTn[j,k].PTR = sst_pagesel{s); 
SST[sst_pagesel(s)) = s; 
SSTPF[sst_pagesel(s),sst_rowsel(s)].USED ++; 
END FOR EACH s; 

END; 



where dat_rowsel(s). rt_pagesel(i.s). rt_rowsel(i.s), sst_pagesel(s) and sst_rowsel(s) are function that are bonded, to 
calculate the most suitable row or page to avoid routing congestion for a specific table (DAT, RTi. SST, SSTPF) starting 
as input data from the address s selected for compression. The pseudo-code for the listed functions is as follows: 



so 
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function dat_rowsel(s: INCVECT) RETURN row 
TYPE ROW IS row location of DAT, RTi, SST, SSTPF; 
row : row of DAT; 
BEGIN 

r slice of WLdat msb of s V 
row := s(n-1,n-WLoAT); 
END dat_rowsel; 

function rt_rowsel(i: i-nth RT identifier, s: INCVECT) RETURN row 
TYPE ROW IS row location of DAT. RTi, SST. SSTPF; 
row : ROW; 
BEGIN 

IF(i = 1)THEN 

/* slice of WLrti bits of s V 

row := s(n-WLoAT-1. n-WLoAT-WLRxi); 

ELSE 

r slice of WLrv bits of s V 

row:=s(n-WLoAT-(X VVLRTi,)-1.n-WLoAT-(X ^LRih)); 

END IF; 
END rt_rowsel; 

function rt_pagesel(i: i-nth RT identifier, s: INCVECT) RETURN page 

TYPE PTR IS. pointer to page of DAT. RTi. SST. SSTPF; 

TYPE ROW IS row location of DAT. RTi, SST. SSTPF; 

page : PTR; 

row : ROW; 

BEGIN 

row = rt_rowsel(l, s); 

tmp_used := MAXINT; 
FOR u = 1 TO PGLRTi 

IF (tmp_used > MIN(RTi[u.row].USED) THEN 
tmp_used := MIN(RTi[u.row].USED; 

END IF; 
END FOR; 

page := tmp_used; 
END rt_pagesel; 

function sst_rowsel(s: INCVECT) RETURN row 
TYPE ROW IS row location of DAT. RTi. SST, SSTPF; 
row : ROW; 
BEGIN 

r slice of WLssT bits of s V 
row := s(WLssT*1. 0); 
END dat_rowsel; 



12 




EP 0 978 966 A1 



function sst_pagesel(s: INCVECT) RETURN page 
TYPE PTR IS pointer to page of DAT. RTi, SST. SSTPF; 
TYPE ROW IS row location of DAT. RTi, SST. SSTPF; 
VAR page : PTR; 
VAR row : ROW; 
BEGIN 

row = sst-row5el(nst. s); 

tmp^used := MAXINT; 
FOR u = 1 TO SSLL 

IF (tmp^used > f^lN(SSTIu,row].USED) THEN 
tmp.used MIN(SST[u,row].USED; 

END IF; 
END FOR; 

page := tmp_used; 
END sst_pagesel; 



20 



3.1.4.1 NOTE ON OPERATING MODES 

[0097] The part of the algorithm that is executed in the Normal Operation Mode is wholly hardware implemented, 
25 while the part that is executed in the Configuration Mode is wholly software implemented (the implemented architecture 
only provides for the primitives needed by the configuration software to write the physical tables). The field USED }s not 
really present in the physical tables, it is only present in a software image of the physical tables used by the configura- 
tion software during the configuration phase. 

30 3.1.5 EXAMPLES 

Example 1 

[0098] In Rgure 14 an example of operation of the CSSA method of the inventbn is shown. This example helps in 
35 understanding both the Configuration Mode and the Nornnal Operation Mode. In this example the whole space U is rep- 
resented by all the addresses of eight bits. We are interested in compressing the eighth ones belonging to U, namely: 
addrO, addrl, addr2, addr3, addr4, addrS, addr6, addrJ, which form the set S. 
[0099] Summarizing: 

^ U^{ao 3255}; 

S ^ {ao,..,.a7} ^ ^addrO. addrl. addr2, addr3. addr4. addrS, addr6. addrTj; 

C^{aQ ay}] 

45 

[01 00] The number of clusters is chosen as N^uster = ^ and the length of each cluster is set to SSLL = 4. The param- 
eter Nsubspace = ^ been chosen, so that the whole space U is split in four equal subspaces: SubO, Sub1, Sub2, 
Sub3. The addresses configured for compression are encoded in both hexadecimal and binary code (i.e. addrO = 42 
(hex) /OUOO-00-10 (bin)). 

50 [0101] The binary representation has its digits grouped into pairs (separated by •-*): the 1®^ pair is loaded to split U 
into four subspaces {SubO, Sub1, Sub2, Sub3)\ the 2"^. 3"^ and 4^ pairs are loaded to select the position within each 
page where to route each addh for the routing tables RTI. RT2 and RT3, respectively, (this degree of freedom is used 
by the configuration part of the algorithm to avoid routing congestion of a table RTi), The clustering of the elements of 
the S set In Nduster sets is performed by choosing in a proper way the pointers RTi[i.k],PTR. 



55 



Example 2 

[0102] Figure 15 shows another example of algorithm operation. 
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3.1.6 ALGORITHM PROOF 

[0103] The algorithm proof will be performed by a dimensioning DAT. RTi. SST and SSTPF for a given general prob- 
lem and proving in a constructive way that with the calculated dimensioning, for all possible couple (INCVECT. OUT- 
VECT), exist a set of parameters W that allows the desired transformation S-^C. 
[0104] The proof will be carried out following this scheme: 

1 . for each stage, a number of links (rows * pages) sufficient to allocate all addresses to be compressed is allocated 
(sufficient condition) ; 

2. between each pair of adjacent tables a convenient partitioning (number off pages, number of rows), suitable to 
avoid routing congestion, is defined (sufficient condition); 



[0105] Steps (1), (2) prove that between each pair of tables all the addresses can be allocated and routed without 
congestion. This prove the algorithm since steps (1) and (2) are iterated on all tables starting from, the SST and prop- 
15 agating tackward to tiie DAT. ^ 

3.1.6.1 DIMENSIONING OF SST AND SSTPF 

[01 06] The set of addresses to be compressed S has elements. As a consequence: 

[01 07] The number of clusters N^uster (which must be a power of 2) is chosen depending on the required maximum 
duration off the sequential search phase, which depends on the lengtti of each sequential search list (SSLL). 



20 



25 



30 



(2) SSLL = 2'y^ 



cluster 



[01 08] So. SST fe a table off N^f^^^^, pages where each page has SSLL rows. SSTPF is a table wit/7 Ndust^ rows. 
3.1.6.2 DIMENSIONING OF RTI - PREAMBLE 

J^® dimensioning of the routing tables RTi starts from ttie one near the SST (RTnst) and propagate backward 
to RT1 . To characterize each RTi, three dimensions are needed: 

PGLf^ri page length for i-nth RTT 

40 

( PGLj^ji is a power of 2, n^^^ = log^ PGL^, ); 



45 NpQpfjf number of pages of RTi; 

WLpTi word width of RTi (expressed in bit). 

[01 1 0] It is crucial to choose the values of f^pg^jj and 

so 



sufficiently large to avoid routing congestion. To do this, a set of equations, each one taking in account each possible 
55 different kind of block condition, must be written. 



OQC\D: <EP 0978966A1 J_> 



14 



s 



10 



15 



25 



30 



45 



SO 



EP0 978 966 A1 

3.1.6.3 DIMENSIONING OF FTTNST 

[0111] Starting from the table RTnst (the one feeding the connected to SST), in order to address each page of SST 
the following relation must be verified. 

RTnst ^^^2 NclustGr 

A/P^«r.sr*2"'"^'^™'>2''- (4) 

[0112] Equation (4) defines an increment of the number of compressed addresses sufficient to ensure that in RTnst 
a sufficient number of links to allocate all the addresses belonging to S (case of fully shuffled addresses) is present. 
In order to set a proper value of Npg/^nst, a strategy to avod routing congestion must be followed. 

3.1.6.3.1 No congestion condition 



[01 1 3] A key factor to be kept under control in any routing process is the congestion of Routing Tables; each incoming 
address is in lad < < routed > > to the conrect cluster passing through the Routing Tables. The congestion occurs in those 
20 rows where the USED fields have a relatively high value; this happens when a lot of different addresses exhibit the 
same slice of bits s(bf,..,b^) in the same RTi (case of fully collapsed addresses). In this case, these collapsing 
addresses must be split on different pages and this set the number of pages for each RTi. The following equation 
expresses this circumstance 



(5) 



cluster 



eq. (5) expresses a circumstance based on a property of binary numk^ers: 
the number of vectors of n bit which exhibit the same pattern of 



35 "^CInTiin 



adjacent bits (in any arbitrary but fixed position in the vector) is 
40 T/ 



The nuntber of addresses to be considered Is upper bounded by 



Thus, the minimum between 



'^Yr'^ and 



55 must be chosen 

[0114] In eq. (5). the expression 
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niin(2V X'"") 



5 



10 



take the value 

in any pratical case, as a consequence eq. (5) becomes: 



IS eq.(5/50 defines a sufficient condition to avoid congestion at nst stage (RTnsl). 
[01 1 5] Now. substituting eq. (5') in equation (4) we obtain: 



20 



25 



30 



/ ^'tWr /^^ cluster 



[01 1 6] The latter relationship that must be verified for 



is determined by a reachability condition. 
35 3. 1 .6.3.2 Reacheability condition 

[0117] This condition imposes that all pages of SST can be reached from any fully routed page of RTnst (a page in 
which the USED field is different from zero for at least a row) 

""PGL^^ ^ logaCVcft^i^,). (7) 
[0118] Eq. (5') and (6) and (7) give the required dimension of RTnst. 

[01 19] Besides dimensioning, another parameter needs to be defined in order to perform the algorithm presented in 
the previous paragraph: that is. the maximum reuse for each row of RTnsl. that is the maximum acceptable value for 
45 the RTnstO,k].USED field must be detemiined; this value will be named 

50 

[0120] This value is calculated on the base of an allocability condition. 
3.1 .6.3.3 Allocability condition 

55 [0121] If in any row of any page of RTnst the USED field exceeds the parameter SSLL. the addresses belonging to 
that row will not be allocable by any page of SST (not even by an empty page because an empty page can allocate a 
maximum of SSLL entries). To prevent this circumstance the value for 
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5 must be bounded by SSLL 



(8) 



70 [0122] When this equation is verified, the dimensioning of RTnst is completed. Summarizing: 



15 



20 



Npg 

RTnst ^ ^cluster* 



''''^StflT^ ^IO9 2(^c/lisfef)• 



[0123] To save memory and to simplify hardware implementation: 



25 



30 



^PK RTnst — ^cluster » 



35 



40 



45 



3.1.6.4 DIMENSIONING OF FUl 

[0124] For all the other RTi different from RTnst. the above remains valid by substituting N^uster with Npg^f^st- ""^us: 

14^1^7-, = log2(A/pg«7,-): (9) 



and the following equation sets an increment for the nurhber of compressed addresses 



Npg 



RTf 



>2 



(10) 



so sufficient to ensure that in the i-nth stage (RTi), a sufficient number of links to allocate ail the addresses belonging 
to S (case of fully shuffled addresses) is available. 

[0125] As for RTnst, the maximum reuse for each row of RTi. that is the maximum acceptable value for the 
RTl[j,k].USED field must be determined in order to perform the algorithm presented in the previous paragraph: this 
value will be named 

55 

n„ 
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and is calculated on the base of the allocability condition. 

3.1.6.4.1 Allocability condition 

5 [0126] If in any row of any page of any Rti, the USED field exceeds a certain function of the next RTi (RTi+1). the 
addresses belonging to that row will not be allocable by any row of any page of RTi+1 (not even by an empty page 
because an empty page of RTi+1 can allocate a maximum of 

entries). If this bound Is exceeded, 

15 ^reuxejtn^i 

Will be exceeded somewhere in RTi+1 and so on up to reach RTnst and eventually SST where the overallocation error 
(RTnstg.m.USED > SSLL) will be evidenced. The bound 

20 prij *„ 

is valid when the addresses routed by a row of RTi are symmetrically split among the PGLpn^^ rows of a page of RTi+1 : 
this is an optimistic circumstance. In the worst case all these addresses falls into the same row and consequently the 
25 bound will be 

30 

[01 27] Summarizing : 

35 all addresses are symmetrically split into PGL row of a page of RTi+1 ; ^ ^ 

reuse frri ^ " reuse f^^, aH addresses fall into the same row of a page of RTi+1 ; (12) 

40 [0128] To set the less strict condition as a bound, a couple of equations must be simultaneously verified on both RTi 
and RTi+l ; so the following system must be verified for each pair of adjacent Rti. 

(13) J ~ * "reiixr^,., ' V • • ^ 



[01 29] Taking into account that 

so 

can not be exceeded, it can be considered as SSLL for SST in case of fully collapsed addresses, so the no congestion 
55 condition remain valid by suiDStituting Nc/uster with NpQf^r,^^ 
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(14) . 

where NpgrrrM <s known. Eq.(l4) can be simplified as previously done with eq.(5) and this leads to 

10 Npg prj-f ^ Npg R7/+1 '• 

(14') 

eq.(14/l4') represents sufficient condition to avoid congestion at the i-nth stage. 
[0130] Now. by substituting eq. (14*) in equation (10) we obtain: 



20 



25 [01 31 1 To save in memory requisite, the same number of pages can be allocated for each RTi: Npg = Npg f^^^^ . 
[01 32] The last condition that needs to be verified for 

30 

is still set by the reachability condition. 

3.1.6.4.2 Reacheability condition 

35 [01 33] This condition impose that all the pages of RTi+1 can be reached from any fully routed page of RTi (a page in 
which the USED field is different from zero for anyone row): 

40 

[0134] Eq. (14*) and (15) and (16) give the dimension of RTi v i < nst. Summarizing: 

m-OT,. = log2(A/pgff7.y^i): 
Npg RTi ^ Npg : 



-RTi 

SO 



f^PGL^,^^2iNp9m-My 



[0135] To save memory and to simplify the hardware implementation: 

55 
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3.1.6.5 DIMENSIONING OF DAT 

15 



[0136] The slices of bits which address the rows of each DAT. RTi are related by the following equation 



nst 



nsf 



^ n = toQsiN^^^) + £ tog^iPGL^i) ^n = + ^ np^,^; (17) 

SO. 

25 



"sutyspace = " " 2 '^PaLf^- (18) 



/»=1 

30 [01 37] Regarding the parameter WLqat it is set by eq.(1 9) 

^^DAT = ^2iNP9RT^l (19) 

35 [0138] Eq.(1 8) and (1 9) give the dimension of DAT 



nst 
/»=1 

40 



[0139] Assuming Npg^^^ = A/^^^^^^. as previously done with RTi 

-45 

nst 

^subspaoe ^ ^ '^PGLfjrr 



^^0/i7 = '0g2(A/^^te«). 

3.1 .7 ABOUT THE ALGORITHM 
55 [0140] As already assumed a proof that routing congestion will be prevented implies that allocation of 

^reuseRTi 



3DOCID: <EP 097e966A1 J_> 



20 




EP 0 978 966 A1 



links be not exceeded. This is done by monitoring all the fields RTiQ.Kl.USED during a Configuration Mode phase will 
allocating links on the emptiest pages. This allocation strategy may be referred as < (maximum spread >> since it 
spreads the addresses on the largest possible number of pages. 

5 3.2 EXTENSIONS OF THE CSSA TECHNIQUE 

[0141] Performances of a CSSA system can t>e further improved by modifying the algorithm. These alternative 
embodiments of the basic CSSA technique of the invention will be referred to as EXTENDED CSSA followed by the 
notation #1. #2, #3, #4 for identifying as many alternative embodiments. Two kinds of improvements can be obtained: 

10 

1) further decreasing memory size (Msize); 

2) further decreasing number of clock cycles needed to carry out the algorithm (Ndl^. 
3.2.1 EXTENDED CSSA#1 

15 

[0142] Basic CSSA algorithm can be further improved modifying the sequential search phase. The extension is 
named EXTENDED CSSA#1 . Two kinds of improvements can be ctotained: further decreasing memory size (Msize) or 
further decreasing number of clock cycles needed to carry out the algorithm (Nclk). 

[0143] These further improved embodiments generally imply replacing the Sequential Search step with an Extended 
20 Sequential Search step. In architectural terms this means replacing the SST (Sequential Search Table) with an ESST 
(Extended Sequential Search Table) as in Rgure 16. 

[0144] An architecture of ESST block according to a first emt)odiment (EXTENDED CSSA #1) is shown in Figure 17. 
[0145] The ESST block is built with a bank of Nsst SST,, with Nsst independent Address Generators, each Address 
Generator generating an address for the corresponding SSTj. A set of Nssti comparators which compare the result of 
25 the search for each SSTi with the Incoming Vector (INCVECT) complete the architecture. As the CLuster Identifier 
(CLID) is provided, the search phase starts in parallel on all the SSTis. As soon as a SSTi finds the Compressed 
Address, the search stops, the Compressed Address is sent out and validated through the Outcoming Vector Validation. 
[0146] To better understand the magnitude of the improvement it is useful to compare a basic CSSA system with an 
EXTENDED CSSA#1 system for solving the same compression problem. 

30 

a) The SST in the CSSA is defined by these parameters: 

• Ncluster_sst number of clusters (pages) of SST; 
35 • SSLL_sst number of rows of each page of SST. 

Regarding the Memory Requirements the CSSA is characterized by these parameters:! 
Msizejdat_cssa amount of memory needed by DAT in CSSA; 

40 

Msize_rt_cssa amount of memory needed by all the RTi in CSSA; 

Msize_sst amount of memory needed by SST in CSSA; 

45 • Msize_cssa=Msize_dat_cssa + Msize_rt_cssa + Msize_sst. 

Regaiding the speed Requirements the CSSA is characterized by these parameters: 

• Nclk_dat_cssa number of clock cycles needed to perform the CSSA algorithm through DAT in 
so CSSA; 

Ndk_rt_cssa number of clock cycles needed to perform the CSSA algorithm through all the RTi in 

CSSA; 

55 • Nclk_sst number of dock cycles needed to perform the CSSA algorithm through SST in CSSA; 

• Nclk_cssa = Ndkjdat^cssa + Nclk_rt_cssa + Nclk_sst_cssa (total number of dock cycles to perform the CSSA 
algorithm). 
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b) The ESST in the EXTENDED CSSA#1 is defined by these parameters: 

• Nssti number of SSTI which are instantiated in ESST; 

• Ncluster_ssti number of cluster (pages) of each SSTi; 

• SSLL.ssti number of rows of each page of each SSTi; 

Regarding the Memory Requirements the Extended CSSA#1 in characterized by these parameters: 

• Msize_dat_ecssa amount of memory needed by DAT in EXTENDED CSSA#1 ; 

• Msize_rt_ecssa amount of memory needed by all the RTi in EXTENDED CSSA#1 ; 

15 • Msi2e_esst_ecssa amount of memory needed by all the SSTi in EXTENDED CSSA#1 ; 

Regarding the Speed Requirements the Extended CSSA#1 In characterized by these parameters: 

• Nclk_dat_ecssa number off clock cycles needed to perform the CSSA algorithm through DAT in 
20 EXTENDED CSSA#1; 

• Nclk_rt__ecssa number of clock cycles needed to perform the CSSA algorithm through all the RTi in 
EXTENDED CSSA#1; 

25 • NcIk_essLecssa number of clock cycles needed to. perform the CSSA algorithm through all the 

SSTi in EXTENDED CSSA#1 ; 

• Nclk_ecssa = Nclk_dat_ecssa + Nclk_rt_ecssa + Nc!k_esst_ecssa (total number of clock cycles to perform the 
EXTENDED CSSA#1 algorithm). 



[0147] If the goal is l^size reduction the parameters are set to obtain the maximum saving in memory requirements 
and the relationship between a basic CSSA system and a system according to the embodiment: EXTENDED CSSA #1 
is: 

35 • Nclusterjssti = Ncluster_ssl for each i; 

• SSLL_ssti = SSLL_sst; 

each cluster in EXTENDED CSSA#1 is multiplied by a Nssti factor, but being used Nssti SSTI tables in parallel the total 
40 Sequential Search phase is still the same as in the normal CSSA. 

[0148] As a consequence of this parameters setting, the time performances are 

Nclk_dat_ecssa = Nclk_dat_cssa; 

45 • Nclk_rt_ecssa = Nclk_rt_cssa; 

Nclk_esst_ecssa = Nclk_ssl_cssa. 

[0149] Being the total number of dock cycles dominated by Nclk_esst_ecssa in EXTENDED CSSA#1 and by 
so Nclk_sst_cssa in CSSA. we can state that 

Nclk_ecssa s Ndk^cssa: 

this is only a rough estimation, the effective time performance Is even better, due to the possible reduction of the number 
55 Of RTi stages, as result of the increased dimension of dusters in the EXTENDED CSSA#1 case. 
[01 50] The memory requirements are 

• Msi2e_esst_ecssa = Msize_sst_cssa * Nssti; 
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this Is the only memory increase because, as will be proved in the next paragraph devoted to performance comparison, 
Msize_dat_ecssa < Msize_dat_cssa; 
5 • Msi2e_rt_ecssa < Msi2e_rt_cssa; 
as a consequence, 

Mslzejecssa < Msize_cssa. 

10 

[0151] If the goal is Ndk reduction, the parameters will be set to obtain the maximum gain in speed, while keeping 
constant the total amount of memory, the relationship between a basic CSSA system and a system according to the 
embodiment EXTENDED CSSA#1 is: 

75 • Ncluster_ssti = Nduster.sst for each i; 

• SSLL^ssti = SSLL_sst/Nssti; 

Nssti must be chosen so that SSLL_sstl ^ 1 . 

20 each cluster in EXTENDED CSSA#1 still has the same size as in CSSA. but being used Nssti ssti in parallel the total 
Sequential Search phase is reduced by a Nssti factor. 

[0152] As a consequence of this parameters setting, the memory requirements are 

• Msizejdat_ecssa = Msizejdat^cssa; 

25 

Msize_rt_ecssa = Msize_rt_cssa; 
Msize_esst_ecssa = Msize_sst_cssa; 
30 thus, 

Msize.ecssa s Msizejcssa. 
[0153] Regarding the time performances, 

35 

Nclkjdat_ecssa = Ndk_dat_cssa; 

Ncik_rt_ecssa = Nclk_rt_cssa: 

40 • Nclk_esst_ecssa » Ndk_sst_cssa/Nssti. 

[0154] Being the total number of clock cycles is dominated by Nclk_esst_ecs5a in EXTENDED CSSA#1 and by 
Nclk_sst_cssa in CSSA, we can state that 

45 • Ncik_ecssa ^ Nclk.cssa/Nssti. 

3.2.2 EXTENDED CSSA #2 

[0155] Another embodiment of the basic CSSA can be realized with yet another ESST innplementation. 
so [0156] This embodiment is shown in Figure 18. 

[0157] Instead of a plurality of SST wide as the incoming vector, as in EXTENDED CSSA#1 . it is possible to use a 
single, wide memory, large as the incoming vector width multiplied by Nsst. In this case a single address generator is 
needed, but Nsst comparators are needed. 

[0158] In any case the performances obtained with the EXTENDED CSSA#1 and the EXTENDED CSSA#2 are equiv- 
55 alent. 
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3.2.3 EXTENDED CSSA #3 

[01 59] Another possible embodiment of the CSSA, EXTENDED CSSA#3. is based on splitting of the algorithm in two 
different steps, named respectively Cluster Detection and Sequential search. Each step can last till to one cell (packet) 
time, due to the two fifb pipeline, that is implemented as depicted in Figure 19. 

[0160] In the first phase (Cluster Detection) the DAT and RTi analysis are performed, and a cluster identifier (OLID) 
is detected. 

[0161] In the second step the Sequential Search is performed to find the compressed identifier, 

[01 62] According to this embodiment it is possible to increase the cluster size, with strong benefits in terms of memory 

size reduction. The price to pay is a latency of two cells (packets) as compared to the "standard" CSSA latency of one 

cell. 

[01 63] In any case the Msize reduction is limited by the size of the SST that must be at least equal to the minimum 
theoretical size (CAM). 

[0164] This embodiment can be easily coupled with either the EXTENDED CSSA #1 or the EXTENDED CSSA #2 
architecture to increase the cluster size again. 

3.2.4 EXTENDED CSSA #4 

[01 65] Yet another and particularly efficient embodiment of the basic CSSA technique of this invention may be suitable 

to compress different classes or sets of addresses S^. SNctasses^ the sets of addresses to be compressed 

(whose union is named S). belonging to the set U, the whole addressing space. 

[01 66] For each address belonging to the generic set Sj the algorithm must identify one and only one address belong- 
ing to Cj, the set of compressed addresses which con-esponds to the set Sj (i. e. perform a transformation S, ~> C/); this 
must be verified v/e 1,..,A/c/ass. ' 
[0167] An example of graphical representation of this problem is given in Figure 20. According to this embodiment 
(EXTENDED CSSA #4). a combination of the three fundamental steps of the basic algorithm: splitting of S in sub- 
spaces via direct addressing table (DAT), clustering via routing fables (RTjj) and sequential search via at least Nclass 
sequential search table (SSTj). are used. 

[01 68] However, tiie system of this embodiment combines in a tree, with an arbitrary number of levels, different Clus- 
tering phases (RTi.branch j-nth) working in parallel and originating from a common branch, said common branch being 
the end point of a Splitting phase (DAT) plus a Clustering (RT1_root. RTi_root RTn_root) phase which is the com- 
mon ancestor of ail tiie branches, each leaf of the tree being constituted by a Sequential Search phase (SST branch j- 
nth). This leads to a structure tiiat may be described as a "RT tree". 
[01 69] Figure 21 shows the general structure of an EXTENDED CSSA#4 system. 

[01 70] The system behaves in different ways depending on the incoming vector domains, and the various SST branch 

j-nth and RTi.j-mh_branch are tuned in a most efficient way This architecture allows memory savings by sharing the 

DAT and some RTi,j-nth_branch before any branching off. 

[0171] An example of implementation is depicted in Figure 22. 

[01 72] This embodiment is particularly suitable for IP and multicast applications, 

3.3 PERFORAAANCE COMPARISON 

[0173] In order to perform a correct benchmarking between various address compression techniques, it is useful to 
define tfie main parameters of a generic ( (address compression function) > . 
[01 74] Figure 23 shows tine parameters used to evaluate the architectures. 

2** is the number of possible incoming identifiers (N Is the length, in bits) 

2'^*=P^ is the number of possible Compressed identifiers (Ncpr is the length, in bits) 

AVclk is the minimum packet interarrival/cell time (in clock cycles) 

Ncik is the number of clock cycles needed to perform address compression 

AAsize is ttie total memory size needed to perform address compression 

Nmem is tiie number of physical memory needed 
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[01 75] The parameter N typically dominates the memory size requisite and the Ncpr parameter the complexity of the 

compression process. 

[0176] Any architecture will be constrained to consume no more clock cycles (Ndk) than Avclk. 
[0177] The RAM requisite Msize provides the indicator of the efficiency of the processing structure. 
5 [0178] Two scenarios have been investigated: the << ATM)} and Ihe << IP)) . Both scenarios have been tested by 
supposing a 622 Mbit/s (STM-4) full throughput. Obviously, other speed assumptions {e.g 155 Mbit/s or 1 .3 Gbit/s) will 
imply a fully different comparison results. A 622 Mbit/s throughput has been chosen in order to be in line with present 
trends in ATM switches and IP router technology. 

[0179] The <( ATM)) scenario inpIiesN=24 bits. Assuming 53 bytes/cell, to prove these architectures at 622 Mbit/^ 
10 it means Avclk=26. 

[0180] The <{ IP» scenario implies N=32 bits. Assuming 64 bytes tor the shortest packet, to prove these architec- 
tures at 622 Mbit/s it means AVclks 32. 



Scenario 


N 


AVclk 


«ATM» 


24 


26 


«IP» 


32 


32 



Table 1: Input parameter for benchmarking 



[0181] Each architecture will be tested, for each scenario, for any Ncpr value between 2 and 16: this means examining 
25 the performance over a very wide range of possSDie applications. 

[0182] By writing the performance as an equation and the constraint as a dtsequation, it is possible to state: 

Msize = F(N. Ncpr. P1. P2. ...) for Ncpr e (2., 16) 

30 Ndk = G(N. Ncpr, PI. P2. ...) ^ AVclkfbr Ncpr e (2..16) 

[0183] PI, P2. etc are ((technique dependent free parameters)). For example, when dealing with a Binary-Tree 
Algorithm the free parameter is the numl^er of stages Nst; when using a Clustered Sequential Search Algorithm the 
free parameter is the cluster size. SSLL. 

35 [0184] In order to arrive at an objective performance evaluation, for each technique, the analysis has been performed 
using the ( (good designer approach ) ) : for each technique, for each W. for each Ncpr, the best value of the free param- 
eter (the value that minimizes Msize) has been identified and applied. Figure 24 shows this concept. 
[0185] As far as the dis-equation Nclk <, Avdk is concerned, it is possible to argue that, if the clock that reads the 
memories Mi..MMmem faster (e.g. double) than the dock of the incoming serial stream (address) the performance of 

40 the system can be improved. This is true, but, because the same < (trick) ) could be applied with the same benefits to 
any technique, a ((common reference clock)) has been defined in order to perform a real comparison. 
[0186] The applied < (reference clock) ) is the clock related to the incoming address. 

3.3.1 CAM PERFORMANCE 

45 

[0187] The analysis of CAM is really fast. The number of requested bits is 
RAsize = N*2^^' (bits) 

so [0188] There are no free parameters, and the Ndk will be. in any case, less then Avdk. Nmem obviously is 1 
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For «ATM» scenario (N=24): 



For «IP» scenario (N=32): 
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3 




192 
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384 
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768 


b 




1536 
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3072 


8 




6144 


9 




12288 


10 




24576 


11 




49152 


12 




98304 


13 




196608 


14 




393216 


15 




786432 


16 




1572864 





2 


128 


3 


256 


4 


512 


5 


1024 


6 


2048 


7 


■HJHO 


8 


8192 


9 


16384 


10 


32768 


11 


65536 


12 


131072 


13 


262144 


14 


524288 


15 


1048576 


16 


2097152 



Table 2: Msize for «ATM» 
scenario performing CAM 



Table 3: Msize for «IP» scenario 
performing CAM 



30 [0189] With Ncpr= 16 (2'^1 6=64K compressed identifiers) around 1 .5 and 2 Mbits of CAM are needed. 

[01 90] There is to point out that CAM ceil is more conplex respect to conventional RAM, and that serious technology 
prolDlems arise increasing its size. 

3.3.2 PURE SEQUENTIAL SEARCH ALGORITHM PERFORMANCE 

35 

[0191] An analysis of the efficiency of a sequential search algorithm is rather immediate. 

[0192] The Add vector scans the memory M and, if a data in the memory matches the incoming address value, the 
compressed identifier is set equal to the memory Add. The process is summarized in Figure 25. 
[01 93] The number of bits requested is 

40 

(1) Msize = N*2'*«P' (bits) 

and the dock cycles needed are 

(2) Ndk = 2"^^' 

45 

[01 94] Obviously, this technique can be applied only with small Ncpr (2 to 4.5) values. Nmem is 1 . The following tables 
describe the requisites for the two considered scenarios. 
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For «ATM». scenario (N=24): For «IP» scenario (N=32): 
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16 
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Table 4: Msize for «ATM» Table 5: Msize for «IP» scenario 

scenario performing Pure performing Pure Sequential 

Sequential 



30 

3.3,3 EXTENDED SEQUErslTIAL SEARCH ALGORITHM PERFORMANCE 

[01 95] In this case a free parameter exist. It is the number of memories where it is possible to perform simultaneously 
a sequential search (Nmem). For each memory, an Add(i) vector scan M(i) and. if a data in a memory (1) matches the 
35 incoming address value, the compressed Identifier is set equal to the memory (i) concatenated to Add. . The process is 
summarized in 

[0196] The number of bits requested is 

(1) RAsize = N*2"^'^ (bits) 

40 and the dock cycles needed are 

(2) Nclk = (2***P7Nmem) 

[0197] Unfortunately, the nunt»er of memories that is possible to put in place is limited, and 8 (at most 161) may be 
45 regarded as the largest possible value for Nmem. This limit 2^^^' to 256, 512. The following tables show the requisite 
for the two scenarios. 
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For «ATM» scenario (N=24): For «IP» scenario (N=32): 
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Table 6: Msize for «ATM» Table 7: Msize for «IP» scenario 

scenario performing Extended performing Extended Sequential 

Sequential search searcli 



30 3.3.4 BINARY TREE ALX30RITHM PERFORMANCE 

[0198] Figure 27 shows the structure for a binary tree search. The N-bit wide incoming address is split into different 
vectors of size WO. W1, W2. W(Nst- 1). obviously providing for 

35 (1) N=SWi, for i=0, (Nst-1) 

These addresses are sent to Nst different memory banks (these banks may be organized in a single physical 
memory array: Nmem=1). The output data belonging to bank i is used, concatenated with W(i+1), to address the 
bank (i+1}. 

In this way any bank is Ncpr bit wide and the number of address bits needed is 



40 



45 
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(2) Add(DAT) = WO 

(3) Add(RTi) = Wi+Ncpr 

Because there is no gain in having different values for Add(RTi), it may be set 

(4) Wi = W,for i=1, Nst-1 

By applying the (4) in the (1): 

(5) N = WO + {Nst-1)*W* 

To minimize the global needed memory, the depth of the DAT must be less or equal to the depth of the other 
memories 

(6) WO <, WA+Ncpr 

By combining the (5) and the (6): 

(7) WA^(N- Ncpr) /Nst 
therefore, the following is implemented 
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(8) W = ceil((N - Ncpr) / Nst) 

The equation (8) will be used to size any memory applied in the technique using Nst as a free parameter. 
The number of clock cycles needed is 

(8) Ncik s 2 ^ Nst 

The <<2)> factor appears because the address used to access the «next)> memory bank is written in the 
( (actual > } : a clock cyde is needed to read the < (actual ) ) and another to prepare the address for the ( (next ) > . 
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16 



Table 8: Nclk as a function of Nst 



75 



20 



This shows that the technique remains valid in any scenario. 
The performance is 

(9) Msize = (Ncpr+1 )*2W0 + (Nst -1) * (Ncpr+1)*2<W**Wcpr) 



[0199] Any bank is considered (Ncpr+1) bits wide because an ( (active) > bit is needed, for each address. 
[0200] As an example of the work performed to evaluate the technology, table 9 shows the VV^ parameter, calculated 
25 by applying equation (7) in ((ATM) > scenario. 
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Table 9: as a function of Ncpr and Nst, with N=24 

table 10 and table 1 1 show the equation Msize = F(N, Ncpr, Nst). 

[0201 ] In the last two columns the best performance value in terms of a lowest value of Msize(N. Ncpr) is pointed out. 
together with the related value of Nst. 
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Table 10: Msize as function of Nst and Ncpr in ATM scenario 
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Table 11: Msize as function of Nst and Ncpr in IP scenario 



[0202] In the tables are reported the overall performances for a binary tree search technique, which can be applied 
with any Ncpr value. 
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For «ATM» scenario (N=24): 



For «IP» scenario G**^=32): 
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Table 12: Msize for «ATM» 
scenario performing Binary Tree 
search 



Table 13: Msize for «IP» scenario 
performing Binary Tree search 



[0203] It is evident the fact that implementation of a binary tree search algorithm is almost 1 0 lines more l)urdensome 
30 than the implementation of a CAM technique 

3.3.5 CLUSTERED SEQUENTIAL SEARCH ALGORITHM PERFORMANCE 
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[0204] 
[0205] 
[0206] 
[0207] 
[0208] 
that: 



Figure 3-7 shows the structure that implements a Clustered Sequential search algorithm of the invention. 
Let Cs be the size of the dusters, formerly addressed as SSLL. 
For each cluster, the number of locations is 2*^. 

Let 2^ be the nunrijer of clusters, formerly addressed as Nduster. Moreover, let Cj be the j-th cluster. 

The N bit wide incoming address is split into different vectors of size WO. W1, W{Nst-1). obviously verifying 



(1) W=ZWI, for 1=0, (Nst-1) 

These addresses are sent to Nst different memory banks, called DAT and RTi respectively. These k)anKs may 
be organized in the same physical memory. 

The output data belonging to Rti, concatenated with W(i+1) is used to address RT(h-1). 

The last pointer, read from RT(Nst-1). is used to address a cluster Cj within another memory, called SST. The 
SST stores the < (active > > incoming address values (i.e. the addresses handled by the structure), spread in the right 
cluster. 

Normally, the SST memory has a size different from the one that hosts the DAT and the RTi. 
The Clustered Sequential Search Algorithm has Nmem=2. 

A sequential search, from the first to the last location belonging to cluster Cj, is performed. If the incoming 
address is equal to the data stored in the SST, the address of the SST itself is validated as the corresponding Com- 
pressed Identifier. 

As far as the depth of the SST is concerned, there are 2^" clusters, each 2^® deep. The overall depth SST is 
2Ncpr Therefore, it may be stated that: 



(2) 2^'^2^ ■ 
and 



2Ncpr 
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(4) Ncpr = Cn + Cs 

These relationships give the size of SST: 

(5) SSTslze = N*2'**P'^ 

Of course the rules stemming from the « maximum spread)) approach and identified in paragraph 4. must be 

applied to RTi. 

In RTj (ic 1. Nst-1) Cn <<pages)> are addressed by the value stored in RTMi.1. Within a <(page)), Cs loca- 
tions are needed, in order to prevent a blocking condition. The stored value is Cn bit wide. This give the RTMj size. 

(6) RTjSize = Cn*2'^^' with Oel, Nst-1) 

The first bank. DAT performs a flat addressing function between the WO vector and the first pointer which is Cn 
bits wide. To minimize the total needed memory, the depth of DAT must be less or equal to the depth of the other 
memories constituting the RTj. In fact. 

(7) wo < (Cn+Cs) = Ncpr 

moreover 
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(8) N = WO ^ (Nst-1)*Cs 

Combining the (5) and the (6) we have 

(9) Nst ^ (N - Ncpr + Cs) / Cs 
thus verifying 
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(10) Nsi = ceil((N - Ncpr + Cs) / Cs) 

The equation (10) is used to determine Msize by using Cs as a free parameter. In the followng tables the rel- 
ative values of Cs and Ncpr are shown. 



For «ATM» scenario (N=24): 



For «IP» scenario (N=32): 



30 



35 



40 



45 



SO 





Cs 


1 ' 


1 ^ 


1 ' 


1 ' 


1 «l 




Cs 


1 ^ 


1 3 


1 * 


1 5 


1 •! 


Ncpr 








Ncpr 




2 




12 


9 


7 


6 


5 




2 




16 


11 


9 


7 


6 


3 




12 


8 


7 


6 


5 




3 




16 


11 


9 


7 


6 


4 




11 


8 


6 


5 


5 




4 




15 


11 


8 


7 


6 


5 




11 


8 


6 


5 






5 




15 


10 


8 


7 


6 


6 




10 


7 


6 


5 






6 




14 


10 


8 


7 


6 


7 




10 


7 


6 


5 






7 




14 


10 


8 


6 


6 


8 




9 


7 


5 


5 






8 




13 


9 


7 


6 


5 


9 




9 


6 


5 








9 




13 


9 


7 


6 


5 


10 




B 


6 


5 








10 




12 


9 


7 


6 


5 


11 




8 


6 


5 








11 




12 


8 


7 


6 


5 


12 




7 


5 


4 








12 




11 


8 


6 


5 


5 


13 




7 


5 


4 




3 




13 




11 


8 


6 


5 


5 


14 




6 


5 


4 


3 


3 




14 




10 


7 


6 


5 


4 


15 




6 


4 


4 


3 


3 




15 




10 


7 


6 


5 


4 


16 




5 


4 


3 


3 


3 




16 




9 


7 


5 


5 


4 



Table 14: Nst for «ATM» scenario 
performing Clustered Sequential 
search 



Table 15: Nst for <dP» scenario 
performing Clustered Sequential 
search 
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The shaded values reflect a situation where Ncpr ^ Cs 
The number of clock cycles needed is 
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(11) Nclk=2*Nst+2^ 

The <(2>> factor appears because the address used to access the <<next>> memory bank is written in the 
<<actual)):aclock cycle Is needed to read the ((actual)) and another to prepare the address for the ((next)). Dur- 
ing SST search only a clock cycle per address is needed. 
5 The following tables show the relation values of Cs and Ncpr.. 



For «ATM» scenario (W=24): For «IP» scenario (W=32): 



10 



IS 



20 



25 





cs 1 


2 


3 


4 


=1 


6 




Cs 


2 


3 


^1 


5 




Ncpr 






Ncpr 






2 




28 


26 


30 


44 


74 




2 




36 


30 


34 


46 


76 


3 




28 


24 


30 


.44 


74 




3 




36 


30 


34 


46 


76 


4 




26 


24 


28 


42 


74 




4 




34 


30 


32 


46 


76 


5 




26 


24 


28 


42 


74 




5 




34 


28 


32 


46 


76 


6 




24 


22 


28 


42 


72 




6 




32 


28 


^ 


46 


76 


7 




24 


22 


28 


42 


72 




7 




32 


28 


32 


44 


76 


8 




22 


22 


26 


42 


72 




8 




30 


26 


30 


44 


74 


9 




22 


20 


26 


40 


72 




9 




30 


26 


30 


44 


74 


10 




20 


20 


26 


.40 


72 




10 




28 


26 


30 


...44 




11 




20 


20 


26 


40 


72 




11 




28 


24 


30 


44 


74 


12 




18 


18 


24 


40 


70 




12 




26 


24 


26 


42i 


74 


13 




18 


18 


24 


40 


70 




13 




26 


24 


28 


42 


74 


14 




16 


18 


24 


38 


70 




14 




24 


22 


28 


42 


72 


15 




16 


16 


24 


38 


70 




15 




24 


22 


28 


42 


72 


16 




14 


16 


22 


38 


70 




16 




22 


22 


26 


42 


72 



30 Table 16: Nclk for «ATM» Table 17: Nclk for «IP» scenario 

scenario performing Clustered performing Clustered Sequential 

Sequential search search 

35 

The two tables show that the CSS technique can be applied only with small Cs values. 3 or 4. (this means clus- 
ters with 8. 1 6 locations) 
The performance is 

40 

(12) Msize = Cn*2Wo+ (Nst -l)*Cn*2'^^'^ + vra*^^' 

[0209] Table 18 and Table 19 show the results of the above equation. 

45 Msize = F(N, Nq3r. Cs). 

[0210] In the last two columns the best performance value for Msize(N, Cs) is pointed out. together with the related 
value of Cs. 

so 
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10 



15 



20 



25 



30 



35 



40 



45 



SO 



55 




96 


63 


46 


34,5 


28 




n.a. 


n.a. 


QQ>I 
£JQ*\ 


192 


143 


111 


93 




n^. 


n.a. 


736 


504 


384 


304 


254 




504 


3 


1776 


1232 


944 


768 


639 




1232 


3 


4096 


2880 


2208 


1808 


1536 




2880 


3 


9152 


64O0 


5040 


•r \cJo 


ooZO 




6400 


3 


19968 


14144 


11264 


9264 


7808 




11264 


4 


42752 


30720 


23808 


20480 


17088 




23808 


4 


90112 


64000 


50668 


42496 


37120 




50688 


4 


187392 


135168 


108288 


39936 


31040 




108288 


4 


385024 


282624 


229376 


187904 


172032 




229376 


4 


782336 


565248 


454656 


397312 


339968 




454656 


4 


1572864 


1159168 


925696 


835584 


688128 




925696 


4 


3129344 


2359296 


1912832 


1605632 


1413120 




1912832 


4 


6160384 


4554752 


3932160 


3194880 


2924544 




3932160 


4 



Table 18: Msize as function of Cs and Ncpr in ATM scenario 

i 



is 

mm 



ir:^.;»tgi.<-.-3 













128 


84 


62 


44 


32 


380 


256 


191 


152 


124 


992 


676 


512 


412 


344 


2416 


1664 


1264 


1024 


860 


5632 


3872 


2976 


2436 


2048 


12736 


8832 


6832 


5632 


4740 


28160 


19712 


15360 


12416 


10752 


61184 


42496 


33024 


27136 


23296 


131072 


91904 


71168 


59008 


50176 


277504 


196608 


153344 


62208 


42240 


581632 


407552 


327680 


274432 


230912 


1208320 


856064 


667648 


557056 


493312 


2490368 


1785856 


1384448 


1150976 


1048576 


5095424 


3604480 


2895872 


2400256 


2080768 


10354688 


7421952 


6029312 


5025792 


4227072 



n.a. 


n.a. 


n.a. 


n.a. 


676 


3 


1664 


3 


3872 


3 


8832 


3 


15360 


4 


33024 


4 


71168 


4 


153344 


4 


327680 


4 


667648 


4 


1384448 


4 


2895872 


4 


6029312 


4 



Table 19: Msize as function of Cs and Ncpr in IP scenario 



[0211] These are the overall performance values of the Clustered Sequential Search Algorithm of the Invention 
[0212] Two physical memories are needed (Nmem=2): the first hosting the DAT and the plurality of RTi banks the 
second the SST 

[021 3] The method of the invention can be implemented with different Ncpr values. The method remains valid as lona 
as Ncpr is extremely small (2.3). 

[0214] In the range Ncpr g (8,16) the addressed cluster size is 4 (that means a cluster with 16 positions). 
[021 5] The practicable conditions are resumed in the following tables. 
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For «ATM» scenario (N=24): 



For «IP» scenario (N=32): 



10 



15 



20 



25 



2 


16 




3 


16 


"4 


4 


24 


504 


5 


24 


1232 


6 


22 


2880 


7 


26 


6400 


8 


26 


11264 


9 


26 


23808 


10 


26 


50688 


11 


26 


108288 


12 


24 


229376 


13 


24 


454656 


14 


24 


925696 


15 


24 


1912832 


16 


22 


3932160 



ruL 
3 
3 
3 

3 



2 


i 


16 






3 


16 


ruL 




4 


16 


676 


3 


5 


14 


1664 


o 


6 


14 


3872 


3 


7 


14 


6832 


3 


Q 
O 


16 


15360 


4 


9 


16 


33024 


4 


10 


16 


71168 


4 


11 


14 


153344 


4 


12 


14 


327680 


4 


13 


14 


667648 


4 


14 


12 


1384448 


4 


15 


12 


2895872 


4 


16 


16 


6029312 


4 



Table 20: Msize for «ATM» 
scenario performing Clustered 
Sequential search 



IS Table 21: Msize for «IP» scenario 
performing Clustered Sequential 
search 



[0216] Of course, performing the sequential search on different SSMs. in parallel, (as in an Extended Sequential 
30 Search), it is possible to accept larger Cs values without increasing Nclk. In an embodiment of that kind. Msize can be 
reduced further. 
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3.3.6 OVERALL PERFORMANCE COMPARISON 

[0217] Table 22 and Table 23 shows a comparison between the various known techniques and the CSSA technique 
of the invention for the < (ATM ) > scenario, table 23 showvs the same comparison for < < IP > ) scenario. 
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10 



15 









2 


96 


96 


96 


792 




3 


192 


192 


192 


2048 




4 


384 


384 


384 


4800 


504 


5 


768 


n.a. 


768 


11136 


1232 


6 


1536 


n,a. 


1536 


25088 


2880 


7 


3072 


n^. 


3072 


53248 


6400 


8 


6144 


n.a. 


6144 


82944 


11264 


9 


12288 


n.a. 


n.a. 


174080 


23808 


10 


24576 


n.a. 


n.a. 


360448 


50688 


11 


49152 


n.a. 


n.a. 


737280 


108288 


12 


98304 


aa. 


n.a. 


1490944 


229376 


13 


196608 


n.a. 


n.a. 


2981888 


454656 


14 


393216 


n.a. 


n.a. 


5898240 


925696 


15 


786432 


n.a. 


n.a. 


11534336 


1912832 


16 


1572864 


n.a. 


n.a. 


20054016 


3932160 



Table 22: Msize for «ATM» scenario comparison (bits) 



30 



35 



40 



mm 


i 


MB'- 


mmm 








2 




128 


128 


128 


1584 


aa 


3 




256 


256 


256 


4160 




4 




512 


512 


512 


10240 


676 


5 




1024 


1024 


1024 


23040 


1664 


6 




2048 


rta. 


2048 


51968 


3872 


7 




4096 


n.a. 


4096 


116736 


8832 


8 




8192 


n.a. 


8192 


165888 


15360 


9 




16384 


n.a. 


16364 


348160 


33024 


10 




32768 


n.a. 


n.a. 


743424 


71168 


11 




65536 


n.a. 


n.a. 


1572864 


153344 


12 




131072 


n.a. 


n.a. 


3194880 


327680 


13 




262144 


n.a 


n.a. 


6651904 


667648 


14 




524288 


n.a. 


n.a. 


13762560 


1384448 


15 




1048576 


n.a. 


n.a. 


27262976 


2895872 


16 




2097152 




n.a. 


40108032 


6029312 



Table 23: Msize for «IP» scenario comparison (bits) 



[021 8] The technique that covers the entire Ncpr field with the smallest Msize is obviously the CAM, but in this case 
there are serious problems of implementation, especially for relatively large Ncpr values; moreover it must be remember 
that Msize is given in terms of < <CAM)> bits, which are memory structures for more complex than ordinary «RAM>> 
structures. 

[021 9] The areas applicability of a sequential search algorithm (pure or extended) cover only relatively small Ncpr val- 
ues. On the other hand, the Msize requisite is minimum. This approach may remain a candidate in equipment that 
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needs a limited number of channels (up to 512). 

[0220] The classical Binary Tree search and the Clustered Sequential Search of the present invention appear to be 
the only two techniques capable of covering the entire spectrum of applications. However, by looking to the proceeding 
tables, it is dear that the memory needed for Implementing the CSS is far less than the memory needed for a classical 
Binary Tree search. 



For «ATM» scenario (W=24): 



For «IP» scenario (W=32): 



Ncpr 


BT/CAM 


CSS/CAM 


ratk>% 




Ncpr 


BT/CAM 


CSS/CAM 


Ratio% 




8,25 


N.A. 


NA 




2 


12.37 


N.A, 


N.A. 


3 


10.66 


NA 


N.A. 




3 


16.25 


N.A. 


N.A. 


4 


12.5 


1.31 


954 




4 


20 


1.32 


1515 


5 


14,5 


1.6 


906 




5 


22.5 


1.62 


1388 


6 


16.33 


1,87 


873 




6 


25.37 


1,89 


1342 


7 


17,33 


2.08 


833 




7 


28,5 


2.16 


1325 


8 


13.5 


1,83 


737 




8 


20.25 


1.87 


1082 


9 


14,16 


1.93 


733 




9 


21,25 


2,01 


1057 


10 


14,66 


2,06 


711 




10 


22,68 


2,17 


10^ 


11 


15 


2.2 


681 




11 


24 


2.33 


1030 


12 


15,16 


2,33 


650 




12 


24,37 


2.5 


974 


13 


15.16 


2,31 


656 




13 


25,37 


2.54 


998 


14 


15 


2.35 


638 




14 


26,25 


2.64 


994 


15 


14.66 


2.43 


603 




15 


26 


2.76 


942 


16 


12.75 


2,6 


510 




16 


19,12 


2.87 


666 



Table 24: Binary tree Vs CSSA for 
«ATM» scenariio comparison 



Table 25: Binary tree Vs CSSA for 
20 4(IP» scenario comparison 
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[Q221] In the previous tables, the Msize value for the CSS and for the Binary Tree is norn^lized in respect to a CAM 
Msize. Then the nonnalized BT value is divided by the correct nomnalized CSS value. Tliis produces an indication of 
the gain obtained in applying the CSS technique instead of the BTAs can be readily appreciated, the gain ranges from 
nine to five times in an ATM scenario, and from 15 to six in an IP scenario. 

[0222] This means that, for Ncpr greater than 8 or 9, the Clustered Sequential Search technique of the invention is 
the technique that gives by far the best overall performance. 

[0223] Moreover, with the CSS it is possitrie to markedly the reduce cost of implementation. 

[0224] For example, in the ATM scenario, with Ncpr=1 2 (4096 entries), the needed memory.for a Binary Tree is 1 491 

Kbits and for the CSS is 229 Kbits. 

[0225] If the address compression function is implemented by way of an ASIC, if the Binary Tree technique is used it 
may be necessary to employ an external memory. This can be avoided by using the CSSA: technique of this invention, 
thus reducing the pin requirement. 

Claims 

1 . A method of address compression for a data stream structured in packets or cells, each including a destination 
identifier consisting of a string of N bits (INCVECT) constituting an address space (U) of size 2^ . consisting in an 
algorithm, executable in a predictable time span, mapping 2^^*^ points of said address space (U) belonging to a 
subset (S) of identifiers to be compressed to a unique string of Ncpr bits, constituting a compressed address space 
(C) of size 2'*^*^'^ where Ncpr<N. characterized in that the method comprises the steps of: 

a) splitting said address space (U) of incoming N-bit identifiers into a plurality of subspaces by the use of a 
direct addressing table (DAT), a row of which is pointed by a first predefined slice of the incoming N-b'rt identifier 
(INCVECT) for outputting a first pointer datum; 

b) clustering said subset (S) of N-bit identifiers contained in said subspaces by the use of a plurality of routing 
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tables (RTi). the tables being coupled in a cascade, a page of the first table (RT1) being pointed by said first 
pointer datum and the row of the pointed page being selected by a predefined second slice of the incoming N- 
bit identifier (INCVECT). for outputting a second pointer datum, a page of each of the following tables of the 
cascade being pointed by the pointer datum output by the preceding table and the row of the pointed page 
being selected by a respective predefined slice of the incoming N-bit identifier (INCVECT), for outputting from 
the last routing table (RTn) a final pointer datum, thus identifying a series of clusters located in at least a 
sequential search table (SST). each cluster smaller or equal to a predefined number (SSLL) and storing only 
points of said subsets that belong to the said subset (S); 

c) performing, on a cluster of said first subset (S). having a size equivalent to said predefined number (SSLL) 
as defined In step b). a sequential search In at least a table (SST). said table being organized in pages, each 
page con-esponding to a cluster, composed by a number of rows equal to said predefined number (SSLL), 
within a given packet or cell time slot, by pointing, with the pointer datum outputted by the last one of said rout- 
ing tables (R71) In cascade, a page on which said sequential search is performed; 

d) said pointer datum outputted by the last routing table, concatenated with the row index datum of the selected 
page of said sequential search table (SST) or plurality of tables, verifying tiie match with incoming N-bit iden- 
tifier (INCVECT) constituting said compressed address of Ncpr bits. 

The metiiod according to daim 1, wherein said subsets in which said address space (U) is split have identical size 
and all routing tables (RTi) of said plurality are organized in the same number of pages. 

3. The method according to claim 1 , wherein more tiien one sequential search table (SST) with tiie same number of 
pages are used, the same pointer datum output by the last one of said louting tables (RTi) in cascade, pointing a 

25 page of each sequential search table (SSTj) and the sequential search being performed in parallel on all the 
selected pages of the sequential search tables (SSTD. until the content of row of any of the sequential search 
tables, searched in parallel. verHies the match with said incoming N-lat identifier (INCVECT). 

4. The method according to claim 1 . wherein each row of the sequential search table (SST) is adopted to store a pre- 
30 determined number of vectors (OUTVECT) to be matched with said incoming N-brt identifier (INCVECT). 

5. The methods as described in claims 1, 2, 3. 4. wherein the steps a) and b) are performed during a single cell 
(packet) period, and the step c) is performed during tiie successive, cell (packet) period, in a ppelined an^ange- 
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A method of address compression for a data stream stmctured in packets or cells, each including a destination 
identifier consisting of a string of N bits (INCVECT) constituting an address space (U) of size 2^ . consisting in an 
algorithm, executable in a predictable time span, mapping different points or domains of said address space (U) 
belonging to a certain number (Ndasses) of subsets (Si, Sg. Sj. SNciasses) of identifiers to be compressed to a 
number of strings, being said number equal to the said number of subsets (Ndasses). these strings constituting dif- 
ferent compressed address spaces (Ci . Cg. Cj. CNdasses). characterized in that tfie metiiod comprises the steps of: 

a) splitting said address space (U) of incoming N-bit identifiers into a plurality of subspaces by the use of a 
direct addressing table (DAT), a row of which is pointed by af irst predefined slice of the incoming N-bit identifier 

45 (INCVECT) for outputting a first pointer datum; 

b) clustering said subsets (S^ . S2. Sj. Sncibssgs) of N-bit identifiers contained in said subspaces by the use of a 
plurality of routing tables (RTy), the tables being organized in a tree, a page of ttie first table (RTn) being 
pointed by said first pointer datum and the row of the pointed page being selected by a predefined second slice 
of the incoming N-bit identifier (INCVECT), for outputting a second pointer datum, used to point a page of the 
following tables of the cascade (RT^g) «^ the same subset 8^ has to be clustered or to point to at least two dif- 
ferent tables (RT21 and RT22.. -). downstream of a branching of said tree, if different subsets must be clustered 
to different compressed address spaces, for selecting, by means of a predefined slice of the incoming N-tMt 
identifier (INCVECT), at least two different pointer data suitatrfe to point to a next stage of Routing Table of said 
tree (RTij), and so forth until all bits of the incoming N-bit identifier (INCVECT) have been utilized, outputting 
from the last pointed routing tables (RTnj) a plurality of final pointer data (CLID^. CLID2. CLID:. CLIDNciassGs) 
identifying as many clusters indifferent sequential search tables (SST^. SST2, SSTj. SSTN^ja^s^'g) each organ- 
ized in pages composed by a number of rows equal to said predefined number (SSLLj). each page con-espond- 
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ing to a cluster, and each sequential search table corresponding to points or domains or subsets Sj belonging 
to said address space (U). each cluster being smaller of or equal to said predefined number of rows (SSLL^. 
SSLL2, SSLLj. SSLLNciasses) and storing only points belonging to the said subset of N-bit identifiers (Si . S2. Sj. 
SisiciassGs) ^^ch map to the corresponding said (Ci, C2. Cj. CNciasses) subsets of compressed addresses. 

5 

c) performing, on the clusters belonging to each sequential search table (SSTi, SST2. SSTj, SSTNciasses) 
pointed by said final pointer data (CLID1 . CLID2. CLIDj. CLIDNciasses) a sequential search by means of different 
address generators, one for each sequential search table, verifying the match of the data stored in said 
sequential search tables (OUTVECTp with said incoming N-bit identifier (INCVECT). identifying said corn- 
to pressed addresses subsets (Ci , C2. Cj. CNciasses) oi compressed addresses. 

7. A data processing structure for perfonning address compression for a data stream structured in packets or cells, 
each including a destination identifier consisting of a string of N bits (INCVECT) constituting an address space (U) 
of 2^^ size, by mapping in a predictable time span 2^^^ points of said address space (U) belonging to a subset (S) 

15 of identifiers to be compressed to a unique string of Ncpr bits constituting a compressed address space (C) of size 
2Ncpr vifhere Ncpr<N. the structure receiving an incoming N-bit identifier (INCVECT) belonging to said address 
space (U) containing unique address information upon verifying a match of the destination information contained 
in the N-bit incoming identifier (INCVECT) witti an outcoming N-bit vector (OUTVECT) among a plurality of 2^^^ 
elements, each one in direct relationship with a compressed address, characterized that it conprises 

20 

a) a direct addressing table (DAT) to which a first predefined slice of said incoming N-bit identifier (INCVECT) 
is inputted, pointing a row of said table for oulputting a first pointer datum; 

b) a cascade of routing tables (RT1 ..... RTn). tiie first of which is coupled in cascade to said direct addressing 
25 table (DAT), each organized in selectable pages that are pointed by the pointer datum of the preceding table in 

the cascade, the first table of the cascade (RT1) having a page pointed by said first pointer datum outputted-by 
said direct addressing table (DAT), and a row of tiie thus pointed page of all routing tables of the cascade being 
pointed by respective slices of said incoming N-bit identifier (INCVECT). which is inputted to each routing table; 

30 c) at least a sequential search table (SST) organized in a plurality if pages, or clusters, pointed by the datum 

outputted by the last table (RTn) of said cascade of routing tables; 

d) validation means (=) verifying the coincidence of the destination information contained in said incoming N- 
bit identifiers (INCVECT) with the information contained in the sequentially searched rows (OUTVECT) of the 

35 pointed page of said sequential search table (SST) or plurality of tables. 

8. The data processing structure of claim 7. characterized in that all said routing tables (RTi) of said cascade are 
organized in the same numt}er of pages. 

40 9. The data processing structure of daim 7. characterized in that it includes two or more, sequential search tables 
(SST) organized in the same number of pages pointed by the same datum outputted by the last table (RTn) of said 
cascade of routing tables and searched simultaneously in parallel. 

10. The data processing structure of claim 7, characterized in that It includes a sequential search table (SST) in which 
45 each row hosts more than one vector (OUTVECT) to be matched with said incoming N-bit identifier (INCVECT). 

11. The data processing structures as described in claims 7,8,9.10, characterized in that the operations performed by 
said direct addressing table (DAT) and the operations performed by said cascade of routing tables (RTI...., RTn) 
are executed during a single cell (packet) period, and the operations performed by said at least one sequential 

so search table (SST) are executed during the successive cell (packet) period, said direct addressing table (DAT) and 
said tree of routing tables (RTjj). and said sequential search table (SSTj) are organized in a pipeline employing two 
first-in-first-out registers two cells at the time. 

12. A data processing structure for perfonning address compression for a data stream structured in packets or ceils. 
55 each including a destination identifier consisting of a string of N bits (INCVECT) constituting an address space (U) 

of size 2*^, by mapping in a predictable time span different points or domains of said address space (U) belonging 
to Ndasses sul>sets (S^. S2. Sj, SNdasses) <^ identifiers to be compressed to M string of at least Ncpr bits, consti- 
tuting different conrpressed address spaces (Ci, C2, Cj. CNciasses). the structure receiving an incoming N-bit iden- 
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tifier (INCVECT) belonging to said address space (U) containing unique address information upon verifying a 
match of the destination information contained in the N-bit incoming identifier (INCVECT) with Nclasses of N-bit 
vectors (OUTVECTj) among a plurality of elements, each one in direct relationship with a compressed address, 
characterized that it comprises 

a) a direct addressing table (DAT) to which a first predefined slice of said incoming N-bit identifier (INCVECT) 
IS inputted, pointing a row of said table for outputting a first pointer datum; 

b) a tree of routing tables {HT^^ RT|j...., RT^Ncjasses). the frst of which is coupled in cascade to said direct 

addressing table (DAT), each routing table being organized in selectable pages tiiat are pointed by pointer data 
outputled by preceding tables in the tree-like cascade, the first table of the tree-like cascade (RT1 1) having a 
page pointed by said first pointer datum oulputted by said direct addressing table (DAT), and pointing to a chain 
of routing tables or branching to at least two chains, a row of the tiius pointed page of all routing tables of the 
tree-like cascade being pointed by respective slices of saki incoming N-bit identifier (INCVECT). which is input- 
ted to each routing table, tiie last routing table (RT^j) of each branch in saki tree-like cascade generating a final 
pointer datum (CLID^. CLIDg. CLIDj. CLIDNciasses): 

c) at least Nclasses of sequential search tables (SSTp. each table being organized in a plurality of pages, or 
clusters, pointed by the datum oufputted by the last table (GLIDj) off said tree-like cascade of routing tables'; 

d) validation means (=) verifying the coincidence of tiie destination information contained in said incoming N- 
bit identifier (INCVECT) witti ttie data (OUTVECTj) contained in the pages, belonging to said sequential search 
table (SSTp. pointed by final pointer data (CLID^. CLIDg. CLIDj. CLIDNc^sses) generated by said last routing 
table (RT„j) of each branch in sakJ tree-like cascade. 
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